无需训练 RNN 或生成模型，如何编写一个快速且通用的 AI “讲故事”项目？

查看原文

其他

无需训练 RNN 或生成模型，如何编写一个快速且通用的 AI “讲故事”项目？

Andre Ye CSDN 2020-10-29

作者 | Andre Ye

译者 | 弯月，责编 | 郭芮

头图 | CSDN 下载自视觉中国

出品 | CSDN（ID：CSDNnews）

以下为译文：

这段日子里，我们都被隔离了，就特别想听故事。然而，我们并非对所有故事都感兴趣，有些人喜欢浪漫的故事，他们肯定不喜欢悬疑小说，而喜欢推理小说的人肯定对浪漫的爱情故事没兴趣。看看周围，还有谁比AI更擅长讲我们喜欢的故事呢？

在本文中，我将向你演示如何编写一个AI，根据我们的个人喜好来给我们讲故事，为沉闷的隔离生活增添一份乐趣。

本文可以分为以下几个部分：

1.蓝图：概述整个项目及其构成部分。2.程序演示：在完成编写代码的工作后，作为预览演示系统的功能。3.数据加载和清理：加载数据并准备好进行处理。4.寻找最具有代表性的情节：该项目的第一部分，使用K-Means选择用户最感兴趣的情节。5.总结图：使用基于图表的总结来获取每个情节的摘要，这是UI的组成部分。6.推荐引擎：使用简单的预测式机器学习模型推荐新故事。7.综合所有组件：编写能够将所有组件结合在一起的生态系统结构。

蓝图

我想让AI给我讲个故事。在理想情况下，我希望以真正的技术-文艺复兴时期的方式来训练递归神经网络或其他的生成式方法。然而，以我从事文本生成工作的经验来看，这些训练要么需要花费很长很长的时间，要么就会出现过度拟合数据，导致无法完成“原始文本生成”的目标。另外，还需注意，训练一个性能良好的模型所需的时间超过8个小时，然而据我所知，训练深度学习模型最有效的免费平台Kaggle最多只能免费运行8小时。

我想创建一个快速、通用且每个人都可以实现的项目。这个AI无需训练RNN或生成模型，只需从“故事数据库”中搜索人为创建的故事，然后找到我最喜欢的故事。这不仅可以保证故事的基本质量（由人类创造，为人类服务），而且速度更快。

至于“故事数据库”，我们来使用Kaggle上的Wikipedia电影情节数据集。其中包含了各种类型、国家和时代的3.5万个电影故事，可谓是眼前我所能找到的最佳故事数据库。

该数据集包括发行年份、标题、电影的国家、类型和剧情的文字说明。

现在数据已就绪，接下来我们来设计一个粗略的大纲/蓝图。

1.这个程序会输出五个特性鲜明的故事的概要（这些故事的评论可以更好地区分用户的口味。例如，像《教父》这样的故事，几乎无法分辨每个人的口味，因为每个人都喜欢这部电影。）

2.用户的评分，他们是喜欢、不喜欢还是保持中立。

3.这个程序接收用户对这五个故事的喜好程度，并输出完整故事的摘要。如果用户感兴趣，则程序会输出完整的故事。每个完整的故事结束后，程序都会要求用户提供反馈。该程序将从实时反馈中学习，并尝试提出更好的推荐（强化学习系统）。

注意，我们选择了五个左右最有代表性的故事，目的是为了让模型在有限的数据量下获得尽可能多的信息。

系统演示

刚开始的时候，这个程序会要求你针对三个故事提供反馈。对于程序来说，这三个故事是数据的每个簇中最具代表性的故事。

在回答完前三个入门问题，对你的喜好进行大致评估后，模型就会开始生成你喜欢的故事。

如果你对某个故事的节选感兴趣，那么程序就会输出整个故事供你阅读。

模型会将你的反馈（你是否喜欢故事）添加到训练数据，以改善模型的推荐。当你阅读故事时，模型会不断学习。如果你不喜欢某个故事的摘要，那么程序就不会输出完整的故事，它会继续生成新的故事。

如果你喜欢某个谋杀和警察的故事节选，并给出了“1”作为响应，那么程序就会开始学习，并朝着这个方向推荐越来越多的故事。

这个程序就像“蒙特卡洛树搜索”一样，朝着优化奖励的方向发展，并在偏离太远（与你喜欢的故事类型相距太远）时后退，从而优化你的体验。

数据加载和清理

我们通过pandas 的 load_csv加载数据。

import pandas as pddata = pd.read_csv('/kaggle/input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')data.head()

数据集中的字段包括发行年份、电影名称、国家、导演、演员、类型、该电影在Wikipedia页面上的URL以及剧情的文字说明。我们可以去掉导演和演员阵容，对于我们的推荐算法或聚类方法来说，这两个字段的分类太多了（准确地说，共有12593个导演和32182演员），所以收益不大。然而，电影类型的数量相对较少——100多部电影的分类只有30多个，而且这代表了超过80%的电影（其他电影可以简单地归类为“其他“即可）。因此，我们可以删除导演和演员。

data.drop(['Director','Cast'],axis=1,inplace=True)

我们遇到的另一个问题是括号的引用。众所周知，Wikipedia会针对引用来源编号（例如[3]）。

"Grace Roberts (played by Lea Leland), marries rancher Edward Smith, who is revealed to be a neglectful, vice-ridden spouse. They have a daughter, Vivian. Dr. Franklin (Leonid Samoloff) whisks Grace away from this unhappy life, and they move to New York under aliases, pretending to be married (since surely Smith would not agree to a divorce). Grace and Franklin have a son, Walter (Milton S. Gould). Vivian gets sick, however, and Grace and Franklin return to save her. Somehow this reunion, as Smith had assumed Grace to be dead, causes the death of Franklin. This plot device frees Grace to return to her father's farm with both children.[1]"

例如，对于上述字符串，我们需要删除[1]。最简单的解决方案是创建一个带有每个括号值（[1]，[2]，[3]，…，[98]，[99]）的列表，然后从字符串中删除列表中存在的每个值。这种方法的前提是我们可以确保每篇文章的引用都不会超过99条。尽管效率不是最高，但我们可以通过混乱的字符串索引或拆分来解决这个问题。

blacklist = []for i in range(100): blacklist.append('['+str(i)+']')

这段代码创建了blacklist，这个列表包含了我们不想要的引用标记。

def remove_brackets(string): for item in blacklist: string = string.replace(item,'') return string

接下来，我们可以使用这个blacklist创建一个函数remove_brackets，然后应用到每一列。

data['Plot'] = data['Plot'].apply(remove_brackets)

至此，我们的基本数据清理工作结束了。

总结故事情节

这个系统的关键要素是总结故事情节。由于通常故事读起来都太长，因此总结故事很重要，方便用户选择是否继续阅读。

我们将使用基于图的摘要算法，这是最流行的文本摘要方法。首先创建文档单元图（而其他大多数方法都使用句子作为基本单位），然后选择具有适用于此场景的PageRank版本的节点。Google原始的PageRank版本采用类似的基于图的方法来查找网页节点。

PageRank算法计算图中的节点“中心”，这对于衡量句子中相关信息的内容很有用。该图的构造使用了词袋特征序列和基于余弦相似度的边缘权重。

我们将使用gensim库来总结长文本。与前面的示例一样，实现方法很简单：

import gensimstring = '''

The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called “iterations”, through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.

Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple outbound links from one single page to another single page, are ignored. PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1. However, later versions of PageRank, and the remainder of this section, assume a probability distribution between 0 and 1. Hence the initial value for each page in this example is 0.25.

The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links.

If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.

Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page D had links to all three pages. Thus, upon the first iteration, page B would transfer half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C. Page C would transfer all of its existing value, 0.25, to the only page it links to, A. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A. At the completion of this iteration, page A will have a PageRank of approximately 0.458.

In other words, the PageRank conferred by an outbound link is equal to the document’s own PageRank score divided by the number of outbound links L( ).

 In the general case, the PageRank value for any page u can be expressed as: i.e. the PageRank value for a page u is dependent on the PageRank values for each page v contained in the set Bu (the set containing all pages linking to page u), divided by the number L(v) of links from page v. The algorithm involves a damping factor for the calculation of the pagerank. It is like the income tax which the govt extracts from one despite paying him itself.

'''print(gensim.summarization.summarize(string))

输出：

In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1.

The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links. If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A.

这段总结得很不错（如果你不愿阅读全文的话）。图摘要算法是最有效的总结方法之一，我们将使用该算法总结摘要。下面我们来创建一个函数summary，接收文本并输出总结。但是，我们需要设置两个条件：

如果文本长度小于500个字符，则直接返回原始文本。总结会让文本的内容过于简短。
如果文本只有一个句子，则genism 无法处理，因为它只能选择文本中的重要句子。我们将使用TextBlob对象，该对象具有.sentences属性，可将文本分成多个句子。如果文本的第一个句子就等于文本本身，则可以判断该文本只有一个句子。

import gensimfrom textblob import TextBlobdef summary(x): if len(x) < 500 or str(TextBlob(x).sentences[0]) == x: return x else: return gensim.summarization.summarize(x)data['Summary'] = data['Plot'].apply(summary)

如果不满足这两个条件中的任何一个，则返回文本的摘要。接下来，我们创建一列summary。

运行需要花费几个小时。但是，只需运行一次，而且总结完成后还可以节省以后的时间。

让我们来看看数据集中一些示例文本的处理：

"The earliest known adaptation of the classic fairytale, this films shows Jack trading his cow for the beans, his mother forcing him to drop them in the front yard, and beig forced upstairs. As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk. In this version, Jack is the son of a deposed king. When Jack wakes up, he finds the beanstalk has grown and he climbs to the top where he enters the giant's home. The giant finds Jack, who narrowly escapes. The giant chases Jack down the bean stalk, but Jack is able to cut it down before the giant can get to safety. He falls and is killed as Jack celebrates. The fairy then reveals that Jack may return home as a prince."

结果：

'As he sleeps, Jack is visited by a fairy who shows him glimpses of what will await him when he ascends the bean stalk.'

这篇摘要是一个非常精彩的预告！不仅易于阅读，而且可以让你对电影情节中的重要句子有一个很好的了解。

寻找最具有代表性的情节

为了寻找最具有代表性的情节，我们使用K Means将情节文本分割成一定数量的簇。我们按照文本的簇标签以及电影的国家、类型和年份将电影分成簇以方便查找。越是接近簇中心的电影，越能代表这个簇，因此最具有代表性。这个想法背后的主要思想是：

询问用户他们是否喜欢最具有代表性的电影，为模型提供最多的信息，以弥补以前没有的关于用户喜好的信息。

电影的国家、类型和年份都代表电影中可通过文字中传达的各个方面，这有助于我们快速找到恰当的推荐。从理论上说，最“准确”的推荐应当是在转换成非常非常长的图向量之后，推荐的图向量与原始文本的图向量之间存在某种相似性，但这需要花费很长时间。因此，我们利用摘要的属性来表示。

将文本划分成簇的工作只需进行一次，不仅可以为我们提供电影簇的其他功能，而且还可以为我们在实际提出推荐时提供电影的属性。

下面我们开始。首先，我们需要删除所有标点符号，并将所有文本改为小写。我们可以使用正则表达式创建函数clean()来执行该操作。

import stringimport redef clean(text): return re.sub('[%s]' % string.punctuation,'',text).lower()

我们使用 pandas 的 apply()，这个函数可应用于所有的图。

data['Cleaned'] = data['Plot'].apply(clean)

接下来，我们将数据变成向量。我们使用TF-IDF（term frequency–inverse document frequency）。该方法可以帮助我们区分重要的词和不重要的词，方便将文本划分成簇。该方法可以强调在一个文档中出现多次，但在整个语料库中出现次数很少的单词，并弱化那些出现在所有文档中的单词。

from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(stop_words='english',max_features=500)X = vectorizer.fit_transform(data['Plot'])

我们将这个非常稀疏的矩阵保存到变量X中。由于K-Means是基于距离的，这意味着它会受到维数诅咒的影响，因此我们应尽最大努力来降低向量化文本的维数，这里我们将向量中的最大元素数为500。（如果我没有设置max_features限制，那么K-means就会将除了一个文本之外的所有文本归到一个簇，将那一个文本归到另一个簇。这就是K-Means的维数诅咒的结果，距离都失去了作用，TF-IDF词汇表中会出现数十万个维度，导致除了异常值之外的所有值都被归到同一个簇。

出于同样的原因，在将数据输入到K-Means模型之前，最好先缩放数据。我们使用StandardScaler将数据缩放到-1到1之间。

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X = scaler.fit_transform(X)

下面，我们来训练K-Means模型。在理想情况下，簇的数量（我们需要提出的问题数量）应介于3-6之间（含3和6）。

因此，我们使用列表[3, 4, 5, 6]中的每个簇来运行K-Means模型。我们将评估每个簇的得分，并找出最适合我们数据的簇数量。

首先，我们来初始化存储簇的数量以及分数的两个列表（图中的x和y）：

n_clusters = []scores = []

接下来，我们导入sklearn 的 KMeans 和 silhouette_score。

from sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_score

然后，我们针对预先选择的四个簇数量中的每一个，拟合一个具有n个簇数量的KMeans模型，然后将该数量的簇的得分添加到列表中。

for n in [3,4,5,6]: kmeans = KMeans(n_clusters=n) kmeans.fit(X) scores.append(silhouette_score(X,kmeans.predict(X))) n_clusters.append(n)

接下来，我只需点击Kaggle上的“提交”，然后让程序自己运行，这需要几个小时才能完成。

最后的结果是：表现最佳的簇数量为三个，而且得分最高。

现在我们有了文本标签，可以开始将电影作为一个整体进行分簇了。但是，我们必须采取一些步骤来清理数据。

例如，Release Year从1900年开始。如果采用文字整数值，那么模型就会很迷惑。我们创建一个Age列来返回电影的年龄，简单地用2017年（数据库中最新的电影）减去电影发行的年份。

data['Age'] = data['Release Year'].apply(lambda x:2017-x)

Age从0开始是有实际意义的。

Origin/Ethnicity列很重要，故事的风格可以追溯到故事的来源。但是，该列有分类，例如可以是[‘American’，‘Telegu’，‘Chinese’]。如果想转换为机器可读的内容，我们需要对其进行One-Hot编码，我们通过 sklearn 的 OneHotEncoder 来实现。

from sklearn.preprocessing import OneHotEncoderenc = OneHotEncoder(handle_unknown=’ignore’)nation = enc.fit_transform(np.array(data[‘Origin/Ethnicity’]) .reshape(-1, 1)).toarray()

现在，nation中保存了每一行的One-Hot编码编码值。行的每个索引代表一个唯一的值，例如，第一列（每行的第一个索引）代表“美国”。

但是，目前，它只是一个数组，我们将需要创建数据中的列，将信息实际转换为我们的数据。因此，我们将每一列命名为该向量的列对应的国家（enc.categories_ [0]返回原始列的数组，nation[:,i]索引指向数组中每一行的第i个值）。

for i in range(len(nation[0])): data[enc.categories_[0][i]] = nation[:,i]

我们已成功地将每个故事的国家添加到了我们的数据中了。接下来，我们对故事的类型做相同的处理。类型比国家更重要，因为它传达了关系到故事内容的信息，而这在机器学习模型识别的水平上是无法轻易实现的。

但是，有一个问题：

data[‘Genre’].value_counts()

似乎很多类型都是未知的。不过不用担心，我们稍后再解决。目前，我们的目标是对类型进行One-Hot编码。我们按照上述方式，但会稍作改动，因为有太多类型由于其名称不同而被认为是不同的类型（例如“戏剧喜剧”和“浪漫喜剧”），但实际上都是同一种类型，我们只选择最流行的20种类型，其余的都归类到这20种类型中的一种。

top_genres = pd.DataFrame(data['Genre'].value_counts()).reset_index().head(21)['index'].tolist()top_genres.remove('unknown')

请注意，最终我们会删除列表中的“unknown”，这就是为什么最初出现了21个类型的原因。接下来，让我们根据top_genres来处理类型，如果有的类型不在最流行的20种类型中，则将其替换为字符串“unknown”。

def process(genre): if genre in top_genres: return genre else: return 'unknown'data['Genre'] = data['Genre'].apply(process)

然后，像上面一样，我们创建一个One-Hot编码器的实例，并将转换后的结果保存到变量genres中。

enc1 = OneHotEncoder(handle_unknown='ignore')genres = enc1.fit_transform(np.array(data['Genre']).reshape(-1, 1)).toarray()

为了将这个数组集成到数据中，我们再来创建几列，每一列都用数组中的一列填充。

for i in range(len(genres[0])): data[enc1.categories_[0][i]] = genres[:,i]

我们的数据是One-Hot编码，但仍然存在unknown值的问题。现在，所有数据均已完成One-Hot编码，我们知道，unknown列的值为1的行需要设置类型。因此，我们针对需要设置类型的每个索引，将其类型替换为nan值，以便我们稍后使用的KNN插值器时，可以识别出它是一个缺失值。

for i in data[data['unknown']==1].index: for column in ['action', 'adventure', 'animation', 'comedy', 'comedy, drama', 'crime',

       'crime drama', 'drama', 'film noir', 'horror', 'musical', 'mystery', 'romance', 'romantic comedy', 'sci-fi', 'science fiction', 'thriller', 'unknown', 'war', 'western']:

data.loc[i,column] = np.nan

现在，所有缺失值都标记成了缺失，我们可以使用KNN分类器了。但是，除了上映的年份和国家以外，我们没有太多数据可用于分类。下面，我们使用TF-IDF，从故事中选择前30个单词，作为KNN正确分配类型的附加信息。

我们必须事先清理文本，因此我们使用正则表达式来删除所有标点符号，并将所有本文都转换为小写。

import redata['Cleaned'] = data['Plot'].apply(lambda x:re.sub('[^A-Za-z0-9]+',' ',str(x)).lower())

我们将设置英语标准的停用词，并将特征的最大数量设置为30。经过清理后向量化的文本以数组的形式存储到变量X。

from sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(stop_words=’english’,max_features=30)X = vectorizer.fit_transform(data[‘Cleaned’]).toarray()

像上面一样，我们将数组X中的每一列信息都转移成我们数据的一列，并命名每一列为x中相应列的单词。

keys = list(vectorizer.vocabulary_.keys())for i in range(len(keys)): data[keys[i]] = X[:,i]

这些单词将提供更多背景信息，帮助设置类型。最后，我们来设置类型！

from sklearn.impute import KNNImputerimputer = KNNImputer(n_neighbors=5)

column_list = ['Age', 'American', 'Assamese','Australian', 'Bangladeshi', 'Bengali', 'Bollywood', 'British','Canadian', 'Chinese', 'Egyptian', 'Filipino', 'Hong Kong', 'Japanese','Kannada', 'Malayalam', 'Malaysian', 'Maldivian', 'Marathi', 'Punjabi','Russian', 'South_Korean', 'Tamil', 'Telugu', 'Turkish','man', 'night', 'gets', 'film', 'house', 'takes', 'mother', 'son','finds', 'home', 'killed', 'tries', 'later', 'daughter', 'family','life', 'wife', 'new', 'away', 'time', 'police', 'father', 'friend','day', 'help', 'goes', 'love', 'tells', 'death', 'money', 'action', 'adventure', 'animation', 'comedy', 'comedy, drama', 'crime','crime drama', 'drama', 'film noir', 'horror', 'musical', 'mystery','romance', 'romantic comedy', 'sci-fi', 'science fiction', 'thriller','war', 'western']

imputed = imputer.fit_transform(data[column_list])

设置类型的时候能够识别出缺失值np.nan，并自动使用周围的国家数据和数据中的单词以及电影的年龄来估计类型。结果保存到数组形式的变量中。与往常一样，我们将数据转换为：

for i in range(len(column_list)): data[column_list[i]] = imputed[:,i]

删除One-Hot编码或不再需要的列之后，例如 Genre 的 Unknown 或类别 Genre 变量……

data.drop(['Title','Release Year','Director','Cast','Wiki Page','Origin/Ethnicity','Unknown','Genre'],axis=1,inplace=True)

……数据已准备就绪，没有缺失值。KNN分类的另一个有趣的方面是，它可以给出十进制的值，也就是说，一部电影20%是西方，其余部分是另一种或几种类型。

这些特征都可以很好地用于簇。这些特征与之前获得的簇标签相结合，应该可以很好地表明用户对某个故事的喜爱程度。最后，我们开始分簇，像以前一样，我们将故事分为3、4、5或6个簇，然后看看哪种表现最佳。

from sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scoreXcluster = data.drop(['Plot','Summary','Cleaned'],axis=1)score = []for i in [3,4,5,6]: kmeans = KMeans(n_clusters=i) prediction = kmeans.fit_predict(Xcluster) score = silhouette_score(Xcluster,prediction) score.append(score)

绘制得分情况……

像前面一样，三个簇的表现最好，得分最高。所以我们仅在三个簇上训练KMeans：

from sklearn.cluster import KMeansXcluster = data.drop(['Plot','Summary','Cleaned'],axis=1)kmeans = KMeans(n_clusters=3)kmeans.fit(Xcluster)pd.Series(kmeans.predict(Xcluster)).value_counts()

最好让每个簇都拥有数量差不多的电影。我们可以通过.cluster_centers_方法来获得簇的中心：

centers = kmeans.cluster_centers_centers

首先，我们为每一项分配标签。

Xcluster['Label'] = kmeans.labels_

对于每个簇，我们希望找到距离簇中心欧几里得距离最近的数据点。该点最能代表整个簇。p和q两点之间的距离由p和q对应维度之差的平方和，再取平方根。你可以参考欧几里得距离公式：

由于欧几里得距离是l2范数，因此可以使用numpy的线性代数函数np.linalg.norm(a-b)来计算。

下面我们来看看完整的计算代码，并找到与簇之间的欧几里得距离最小的故事。

for cluster in [0,1,2]: subset = Xcluster[Xcluster['Label']==cluster] subset.drop(['Label'],axis=1,inplace=True) indexes = subset.index subset = subset.reset_index().drop('index',axis=1) center = centers[cluster] scores = {'Index':[],'Distance':[]}

上述代码可以初始化搜索。首先，将标签与我们当前正在搜索的簇相符的故事保存起来。然后，我们从子集中删除Label。为了保存原始的索引以供以后参考，我们将索引存储到变量indexes中。接下来，我们将重置子集上的索引，以确保索引正常工作。然后，我们选择当前簇的中心点，并初始化一个包含两列的字典：一个保存主数据集中的故事索引的列表，

另一个存储得分/距离的列表。

for index in range(len(subset)): scores['Index'].append(indexes[index]) scores['Distance'].append(np.linalg.norm(center-np.array( subset.loc[index])))

这段代码会遍历子集中的每一行，记录当前索引，并计算和记录它与中心之间的距离。

scores = pd.DataFrame(scores) print('Cluster',cluster,':',scores[scores['Distance']==scores['Distance'].min()]['Index'].tolist())

这段代码将分数转换为pandas DataFrame以进行分析，并输出距中心最近的故事的索引。

似乎第一个簇中具有最小欧几里德距离的故事有四个，而簇1和2只有一个故事。

簇0：

data.loc[4114]['Summary']

输出：

'On a neutral island in the Pacific called Shadow Island (above the island of Formosa), run by American gangster Lucky Kamber, both sides in World War II attempt to control the secret of element 722, which can be used to create synthetic aviation fuel.'

簇1：

data.loc[15176]['Summary']

输出：

'Jake Rodgers (Cedric the Entertainer) wakes up near a dead body. Freaked out, he is picked up by Diane.'

簇2：

data.loc[9761]['Summary']

输出：

'Jewel thief Jack Rhodes, a.k.a. "Jack of Diamonds", is masterminding a heist of $30 million worth of uncut gems. He also has his eye on lovely Gillian Bromley, who becomes a part of the gang he is forming to pull off the daring robbery. However, Chief Inspector Cyril Willis from Scotland Yard is blackmailing Gillian, threatening her with prosecution on another theft if she doesn\'t cooperate in helping him bag the elusive Rhodes, the last jewel in his crown before the Chief Inspector formally retires from duty.'

很好！现在我们获得了三个最有代表性的故事情节。虽然人类看不出其中的区别，但在机器学习模型的心中，这些数据为它提供了大量信息，可供随时使用。

推荐引擎

这里的推荐引擎只是一个机器学习模型，可以预测哪些电影情节更有可能获得用户的高度评价。该引擎接收电影的特征，例如年龄或国家，以及TF-IDF向量化的摘要，最大可接收100个特征。

每个电影情节的目标是1或0。模型经过在数据（用户已评价的故事）上的训练后，可预测用户对故事评价良好的概率。接下来，模型会向用户推荐最有可能受到喜爱的故事，并记录用户对该故事的评分，最后还会将该故事添加到训练数据列表中。

至于训练数据，我们仅使用每部电影中数据的属性。

我们可能需要决策树分类器，因为它可以做出有效的预测，快速训练并开发高方差解决方案，这正是推荐系统所追求的。

综合所有组件

首先，我们针对三个最有代表性的电影，编写用户的评分。这个程序会确保针对每个输入，输出为0或1。

import timestarting = []print("Indicate if like (1) or dislike (0) the following three story snapshots.")print("\n> > > 1 < < <")

print('On a neutral island in the Pacific called Shadow Island (above the island of Formosa), run by American gangster Lucky Kamber, both sides in World War II attempt to control the secret of element 722, which can be used to create synthetic aviation fuel.')

time.sleep(0.5) #Kaggle sometimes has a glitch with inputswhile True: response = input(':: ') try: if int(response) == 0 or int(response) == 1: starting.append(int(response)) break else: print('Invalid input. Try again') except: print('Invalid input. Try again')print('\n> > > 2 < < <')print('Jake Rodgers (Cedric the Entertainer) wakes up near a dead body. Freaked out, he is picked up by Diane.')time.sleep(0.5) #Kaggle sometimes has a glitch with inputswhile True: response = input(':: ') try: if int(response) == 0 or int(response) == 1: starting.append(int(response)) break else: print('Invalid input. Try again') except: print('Invalid input. Try again')print('\n> > > 3 < < <')

print("Jewel thief Jack Rhodes, a.k.a. 'Jack of Diamonds', is masterminding a heist of $30 million worth of uncut gems. He also has his eye on lovely Gillian Bromley, who becomes a part of the gang he is forming to pull off the daring robbery. However, Chief Inspector Cyril Willis from Scotland Yard is blackmailing Gillian, threatening her with prosecution on another theft if she doesn't cooperate in helping him bag the elusive Rhodes, the last jewel in his crown before the Chief Inspector formally retires from duty.")

上述代码运行良好。接下来，我们将数据存储到训练数据集DataFrame中，然后删除数据中的索引。

X = data.loc[[9761,15176,4114]].drop( ['Plot','Summary','Cleaned'],axis=1)y = startingdata.drop([[9761,15176,4114]],inplace=True)

下面，我们来创建一个循环。我们在当前训练集上训练决策树分类器。

from sklearn.tree import DecisionTreeClassifiersubset = data.drop(['Plot','Summary','Cleaned'],axis=1)while True: dec = DecisionTreeClassifier().fit(X,y)

然后，针对数据中的每个索引，进行概率预测。

dic = {'Index':[],'Probability':[]}subdf = shuffle(subset).head(10_000) #select about 1/3 of datafor index in tqdm(subdf.index.values): dic['Index'].append(index) dic['Probability'].append(dec.predict_proba( np.array(subdf.loc[index]).reshape(1, -1))[0][1]) dic = pd.DataFrame(dic)

为了确保快速选择，我们在打乱的数据中随机选择大约1/3的数据，并选择前10,000行。这段代码将索引保存到DataFrame。

最初，许多电影的概率都为1，但随着我们的进步和模型的学习，它将开始做出更高级的选择。

index = dic[dic['Probability']==dic['Probability'].max()] .loc[0,'Index']

我们将用户最喜爱的电影的索引保存到变量index。

下面，我们需要从数据中获取有关索引的信息并显示它。

print('> > > Would you be interested in this snippet from a story? (1/0/-1 to quit) < < <')print(data.loc[index]['Summary'])time.sleep(0.5)

然后验证用户的输入是0、1还是-1（退出）：

while True: response = input(':: ') try: if int(response) == 0 or int(response) == 1: response = int(response) break else: print('Invalid input. Try again') except: print('Invalid input. Try again')

……我们可以开始添加训练数据。但是，首先，我们必须允许用户在需要退出的时候结束循环。

if response == -1: break

另外，无论用户喜欢还是不喜欢这部电影，我们都将其添加到训练数据中（目标将有所不同）：

X = pd.concat([X,pd.DataFrame(data.loc[index].drop(['Plot','Summary','Cleaned'])).T])

最后，如果响应为0，我们将0添加到y中。表示用户不想听这个故事。

if response == 0: y.append(0)

如果用户喜欢这个故事，则程序输出完整的故事。

else: print('\n> > > Printing full story. < < <') print(data.loc[index]['Plot']) time.sleep(2) print("\n> > > Did you enjoy this story? (1/0) < < <")

我们再次收集用户的输入，并确保输入为0或1。